Skip to content

ENH: add downcast to pd.to_numeric #13425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jun 10, 2016

Title is self-explanatory. Closes #13352.

@codecov-io
Copy link

codecov-io commented Jun 11, 2016

Current coverage is 84.31%

Merging #13425 into master will increase coverage by <.01%

@@             master     #13425   diff @@
==========================================
  Files           138        138          
  Lines         51157      51176    +19   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43131      43151    +20   
+ Misses         8026       8025     -1   
  Partials          0          0          

Powered by Codecov. Last updated by 5701c69...4758dcc

@jreback jreback added API Design Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 11, 2016
@@ -2097,3 +2097,35 @@ def _random_state(state=None):
else:
raise ValueError("random_state must be an integer, a numpy "
"RandomState, or None")


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in pandas.types.missing.py (new file); let's not add anymore to core/common.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

@gfyoung gfyoung force-pushed the to-numeric-enhance branch from 06612d6 to 052f09e Compare June 11, 2016 13:31
@gfyoung
Copy link
Member Author

gfyoung commented Jun 11, 2016

@jreback : Refactored na_values and added the version tags. Travis is still happy, so ready to merge if there are no other concerns.

@gfyoung gfyoung force-pushed the to-numeric-enhance branch 5 times, most recently from f775312 to 0a503af Compare June 17, 2016 03:57
@jreback jreback added this to the 0.18.2 milestone Jun 17, 2016
@jreback
Copy link
Contributor

jreback commented Jun 17, 2016

lgtm. @jorisvandenbossche

small issue is whethere the kwargs are the right names; further use_unsigned deps on compat_ints which is a little awkward.

maybe:

compat_ints, compat_ints_to_unsigned ? (and decouple them). I know its longer, but its more explicit.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 17, 2016

@jreback : I see what you're saying, but the decoupling would raise the issue of which one takes precedence? i.e. what happens if for some reason or another, someone passes in compact_ints=1 and compact_ints_to_unsigned=1? Also, not sure what I would necessarily say that use_unsigned totally depends on compact_ints - the way I look at it is that it further parameterizes the compacting.

@gfyoung gfyoung force-pushed the to-numeric-enhance branch from 0a503af to 943eaeb Compare June 17, 2016 15:13
@jreback
Copy link
Contributor

jreback commented Jun 17, 2016

@gfyoung that would be an error, both options cannot be True.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 17, 2016

@jreback : hmmm, on the fence regarding this - will wait to see what @jorisvandenbossche says

@gfyoung gfyoung force-pushed the to-numeric-enhance branch from 943eaeb to 1a720ca Compare June 17, 2016 22:48
# compact int64 "NaN" value
int8_na = na_values[np.int8]
int64_na = na_values[np.int64]
s = [int64_na, 2, 3]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the "NaN" like values is the current lib.downcast_int64 behaviour, but I am not sure we should put that in user-facing code?
Because that means you are actually changing the content (and they are currently not used as NaNs)

@jorisvandenbossche
Copy link
Member

At the moment, the implementation in this PR is not really a substitute for the pd.read_csv(... compact_ints=True) I think? As it will only do something when there is actually a conversion from object dtype to integer dtype.

Of course it could be easily changed that the keyword is also used when the dtype is already numeric (the if com.is_numeric_dtype(values): pass at the moment).
But, after second thought, I am actually not sure if to_numeric is the appropriate place for this functionality IMO. As then you would feed integer data to to_numeric, not to "convert it to numeric dtype" (the current purpose of the function), but to downcast them (which broadens the scope of the function). And it feels this should actually belong in a separate function.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2016

@jorisvandenbossche I view to_numeric as basically .astype, with better error handling. So in that vein, I don't think it has to be object to begin with, though maybe we should also add a dtype= option.

alternatively, maybe add these instead to .astype.

and @gfyoung certainly remove the non int64-sentinels though.

@jreback jreback modified the milestones: Next Major Release, 0.18.2 Jun 18, 2016
@gfyoung gfyoung force-pushed the to-numeric-enhance branch 4 times, most recently from 9a01afa to 13b518d Compare June 20, 2016 22:44
@gfyoung
Copy link
Member Author

gfyoung commented Jun 21, 2016

@jorisvandenbossche : that is not correct - downcasting occurs so long as the converted data is of integer dtype, irregardless of its starting dtype i.e. data that is already integral can also be downcasted. This is true in read_csv and in to_numeric as well (with my changes).

While I do see what you are saying that this change could be viewed as expanding the functionality beyond what it was originally intended for, compact_ints and use_unsigned can be viewed as further specifying the manner in which data is casted to numerical values. From this viewpoint, this added functionality remains within the bounds of what this function was originally designed to do.

@jorisvandenbossche
Copy link
Member

@gfyoung Looks good! Made a couple of doc comments

Last comment: I would personally just choose one of 'integer' or 'signed' that we think suits best instead of allowing both for exactly the same

@gfyoung gfyoung force-pushed the to-numeric-enhance branch from 9103251 to bc188c7 Compare July 8, 2016 15:25
@gfyoung
Copy link
Member Author

gfyoung commented Jul 8, 2016

Hmmm...seems reasonable enough @jreback : should we choose one of integer or signed OR just keep both of them to mean "smallest signed integer"?

@gfyoung gfyoung force-pushed the to-numeric-enhance branch from bc188c7 to fdc176d Compare July 9, 2016 19:42
@@ -1788,6 +1788,46 @@ but occasionally has non-dates intermixed and you want to represent as missing.
In addition, :meth:`~DataFrame.convert_objects` will attempt the *soft* conversion of any *object* dtypes, meaning that if all
the objects in a Series are of the same type, the Series will have that dtype.

:meth:`~pandas.to_numeric` is very similar to `pd.to_datetime`, except that it is for conversion to numerical data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make pd.to_datetime a function reference as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

@gfyoung gfyoung force-pushed the to-numeric-enhance branch 3 times, most recently from 7215b74 to 6538f3d Compare July 9, 2016 21:23
@gfyoung
Copy link
Member Author

gfyoung commented Jul 9, 2016

@jreback: with the massive Travis backlog currently going on here. Could you cancel these two old builds of this PR here and here?

1) :meth:`~pandas.to_datetime` (conversion to datetime objects)

.. ipython:: python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use lists here - no need to use a numpy array

also say these operate on 1dim things (or scalars)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Done.

@gfyoung gfyoung force-pushed the to-numeric-enhance branch from 6538f3d to 4758dcc Compare July 9, 2016 21:54
@gfyoung
Copy link
Member Author

gfyoung commented Jul 9, 2016

@jreback: Another old build to cancel here

@gfyoung
Copy link
Member Author

gfyoung commented Jul 10, 2016

@jreback : Travis is passing, so ready to merge if there are no other concerns.

@jreback jreback closed this in 675a6e3 Jul 10, 2016
@jreback
Copy link
Contributor

jreback commented Jul 10, 2016

thanks @gfyoung another nice PR!

though we DO seem to be having long discussions :)

@jreback
Copy link
Contributor

jreback commented Jul 10, 2016

give a look on the docs once built. I edited slightly.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 10, 2016

@jreback: well, I think the long discussion was quite beneficial. We caught a bug in the Travis build script and made the API slightly more compact (one arg instead of two). Will take a look at the doc.

@gfyoung gfyoung deleted the to-numeric-enhance branch July 10, 2016 21:15
nateGeorge pushed a commit to nateGeorge/pandas that referenced this pull request Aug 15, 2016
Title is self-explanatory.  Closes pandas-dev#13352.

Author: gfyoung <[email protected]>

Closes pandas-dev#13425 from gfyoung/to-numeric-enhance and squashes the following commits:

4758dcc [gfyoung] ENH: add 'downcast' to pd.to_numeric
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants